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Averaging  Rules  and  Adjustment  Processes: 

The  Role  of  Averaging  In  Inference 

In  the  many  years  that  psychologists  have  been  studying  hu^n  judgment 
processes,  no  two  findings  have  emerged  with  greater  empirical  support  than 
these:  (1)  Human  judgments  often  appear  to  follow  an  averaging  rule  (Anderson, 
1974) ,  and  (2)  Judgments  In  Bayesian  Inference  tasks  are  usually  conservative 
relative  to  optimal  judgments  (Edwards,  1968). 

Although  these  findings  originated  from  two  quite  different  research 
traditions  (cf.  Slovlc  &  Lichtenstein,  1971),  there  Is  now  considerable  evidence 
that  "averaging"  and  "conservatism"  are  not  unrelated  phenomena.  In  the  present 
paper  I  discuss  this  evidence  and  describe  a  process  model  that  attempts  to 
explain  why  averaging  occurs  In  the  Bayesian  task.  Then  I  present  two  experi¬ 
ments  that  test  an  ordering  prediction  for  the  Bayesian  task  drawn  from  this 
process  account.  Last  I  discuss  the  averaging  model  In  general  and  sketch  Its 
relationship  to  other  kinds  of  algebraic  judgment  rules. 


Bayesian  Inference  and  Averaging 


The  Bayesian  Inference  task  Is  usually  Instantiated  In  terms  of  the 
"bookbag  and  poker  chips  paradigm"  In  which  there  are  two  well-specified 
hypotheses  to  be  considered  by  the  subject,  usually  Involving  populations  of 
binary  events.  For  example,  there  may  be  two  bookbags,  one  containing  70 
red  poker  chips  and  30  blue  poker  chips  (the  "red  bag")  and  another  containing 
30  red  chips  and  70  blue  chips  (the  "blue  bag").  In  most  experiments  the 
experimenter  ostensibly  selects  one  of  these  bags  at  random  and  then  draws 
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samples  of  one  or  more  chifs  from  it  one  or  more  times.  These  samples  are 
shown  to  the  subject,  usually  sequentially,  and  the  subject  is  asked  to  indicate 
the  strength  of  his  belief  about  which  bookbag  was  sampled  using  a  rating  scale 
of  some  kind  (l.e.,  probability,  odds,  log-odds,  etc.). 

According  to  Bayes*  theorem,  the  optimal  response  in  such  situations  is 
found  by  multiplying  the  prior  odds  ratio  for  the  two  hypotheses  {^q)  by  the 
likelihood  ratio  of  the  sample  given  the  two  hypotheses  (LR)  to  obtain  the 
posterior  odds  ratio  for  the  two  hypotheses 


(1) 


If  more  than  one  sample  of  data  is  given,  the  procedure  is  simply  applied 
iteratively;  the  posterior  odds  ratio  following  sample  n  becomes  the  prior 
odds  ratio  for  sample  n-fl: 


n+1 


LR  ’Q 
n+1  n 


(2) 


In  general,  human  responses  to  the  Bayesian  task  are  conservative  relative 
to  Bayesian  responses;  that  is,  human  responses  fall  nearer  neutral  than 
Bayesian  responses.  Although  various  experimental  manipulations  can  be  used 
to  Influence  the  size  of  tlie  conservatism  effect  (e.g.,  size  of  sampling 
unit,  proportion  of  predominant  chips  in  the  population,  response  scale,  payment 
for  accuracy,  and  so  forth),  no  simple  manipulation  has  been  successful  in 
eliminating  conservatism  (but  see  Ells,  Seaver,  &  Edwards,  1977,  to  be 
discussed  below) . 

Three  conceptually  distinct  sorts  of  explanations  have  been  given  for 
the  conservatism  effect.  Misperception  explanations  locate  the  error  in  the 
process  through  which  subjects  estimate  the  likelihood  ratio  of  the  sample 
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data.  In  this  view  conservatism  occurs  because  subjects  underestimate  the 
diagnostic  impact  of  data,  probably  due  to  their  having  overly  "flat"  subjective 
sampling  distributions,  especially  for  large  samples.  Misaggregat ion  explana¬ 
tions  locate  the  error  in  the  process  through  which  subjects  integrate  the 
information  from  multiple  samples  into  composite  responses.  Subjects  ought 
to  multiply,  but  somehow  they  don't.  Response  bias  explanations  treat  conser¬ 
vatism  as  an  artifact  of  the  response  scale  that  is  engendered  primarily  by 
a  tendency  of  subjects  to  avoid  using  extreme  odds  or  probability  judgments. 

Although  the  Bayeslsn  research  literature  has  tended  to  treat  the 
misperception,  misaggregat ion,  and  response  bias  hypotheses  as  competitors, 
all  three  sources  of  error  are  likely  to  occur  in  Bayesian  tasks.  But 
misaggregat ion  is  probably  the  most  Important  theoretically  and  practically 
because  mis aggregation  appears  to  figure  more  prominently  in  producing 
conservatism  (Edwards,  1968;  Wheeler  &  Edwards,  1975)  and  also  because  it  is 
probably  easier  to  teach  people  improved  methods  for  aggregating  responses 
than  it  is  either  to  Improve  their  subjective  impressions  of  the  diagnostic 
impact  of  sample  data  or  to  remove  their  bias  against  using  extreme 
responses  (Ells,  Seaver,  &  Edwards,  1977). 

Averaging  and  conservatism.  The  first  evidence  to  link  averaging  and 
conservatism  came  from  experiments  that  showed  that  when  subjects  were  asked 
to  rate  the  probability  that  samples  had  been  drawn  from  one  of  two  statisti¬ 
cally  well-speclf led  populations,  their  ratings  were  more  often  like  estimates 
of  the  population  proportion  than  they  were  like  Inferences  from  Bayes'  rule 
(Beach,  Wise,  &  Barclay,  1970;  Marks  &  Clarkson,  1972,  1973;  Shanteau,  1970, 
1972).  Shanteau  hypothesized  that  subjects'  behavior  in  such  tasks  could  be 
modeled  by  an  algebraic  judgment  rule  in  which  the  response  R  at  serial 
position  n  is  given  by  a  weighted  average  of  the  scale  values  of  the  previous 
and  current  sample  events: 
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In  this  equation  the  are  the  values  of  the  various  stimuli  and  the 
are  weights  that  sum  to  unity.  The  term  w^Sq  signifies  the  weight  and  value 
of  a  neutral  "Initial  impression."  It  should  be  noted  that  the  averaging  rule 
is  conservative  relative  to  the  Bayesian  rule  since  averages  always  lie  within 
the  range  of  their  component  stimulus  values  whereas  Bayesian  inferences  must 
often  be  more  extreme  than  any  of  their  component  values. 

The  averaging  model  does  a  generally  good  job  of  fitting  subjects' 
Inference  judgments  quantitatively.  But  even  better  evidence  for  the  model 
comes  from  later  experiments  by  Shanteau  (1975)  and  by  Troutman  and  Shanteau 
(1977)  which  show  that  presentation  of  neutral  or  non-diagnostic  information 
causes  subjects  to  revise  their  judgments  toward  neutral.  This  result  is 
exactly  what  would  be  expected  if  subjects  average  the  neutral  Information 
together  with  prior  non-neutral  information.  But  it  is  not  allowed  by  Bayes' 
theorem,  which  specifies  that  neutral  information  ought  to  have  no  impact  on 
prior  judgments.  That  is,  since  the  likelihood  ratio  for  neutral  information 
is  one  by  definition,  a  judgment  that  follows  neutral  information  ought  (by 
Equation  2)  to  be  numerically  identical  with  the  preceding  judgment. 

A  different  sort  of  evidence  linking  averaging  and  conservatism  comes 
from  experiments  by  Ells,  Seaver,  and  Edwards  (1977)  that  were  aimed  at  the 
practical  problem  of  aiding  human  performance  in  the  Bayesian  task.  These 
authors  hypothesized  that  since  untutored  subjects  appear  to  average  when  they 
ought  to  multiply,  they  might  be  better  at  judging  the  mean  log  likelihood 
ratio  for  a  set  of  samples  than  they  are  at  judging  the  cumulative  log  likeli¬ 
hood  ratio,  which  is  the  more  typical  Bayesian  judgment. 

The  hypothesis  was  tested  using  two  groups  of  subjects.  One  group  rated 
"the  average,  rather  than  the  total  of,  their  certainty  (Ells  et  al.,  1977, 
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p.  6)  and  the  other  group  rated  their  total  or  cumulative  certainty.  Then 
responses  from  both  groups  were  transformed  to  log  posterior  odds  form  and 
Individual  regression  analyses  were  performed  for  each  subject  comparing 
Inferred  log  posterior  odds  to  veridical  log  posterior  odds.  The  analyses 
confirmed  the  hypothesis;  although  log  odds  Inferred  from  average  certainty 
judgments  were  now  slightly  radical,  they  were  definitely  closer  to  veridical 
than  odds  Inferred  from  cumulative  certainty  judgments. 

Averaging  and  serial  adjustment  processes.  Although  averaging  rules 
have  often  been  successful  in  accounting  for  judgment  data  quantitatively 
(cf.  Anderson,  1974),  little  attention  has  been  directed  at  finding  out  why 
averaging  occurs  qualitatively.  The  present  research  is  based  on  the 
assumption  that  averaging  results  from  the  intrinsically  serial  nature  of 
multiattribute  judgment,  both  for  tasks  in  which  the  information  is  actually 
presented  serially  (such  rs  the  typical  Bayesian  task)  and  for  tasks  in  which 
the  information  is  presented  simultaneously  but  processed  serially  (such  as 
the  typical  impression  formation  task).  In  either  case,  averaging  is 
hypothesized  to  occur  bee  luse  subjects  adopt  an  adjustment  strategy  in  which 
they  Integrate  new  information  into  "old"  composite  judgments  by  adjusting 
the  old  composite  value  upward  or  downward  as  necessary  to  make  the  "new" 
composite  lie  somewhere  between  the  "old"  composite  and  the  value  of  the  new 
information  (Lopes  &  Johnson,  in  press;  Lopes  &  Oden,  1980;  see  also  the 
research  on  anchoring  and  adjustment  by  Tversky  &  Kahneman,  1974).  Such  a 
process  would  be  equivalent  qualitatively  to  weighted  averaging.  “But  subjects 
would  never  need  "compute"  an  average  in  any  intentional  sense  of  that  term. 
Rather,  averaging  would  simply  emerge  as  a  natural  consequence  of  their 
adjustment  strategy. 

It  is  clear  that  any  judgment  process  yielding  averages  will  differ 
quantitatively  from  Bayes'  theorem.  But  the  hypothesized  adjustment  process 
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also  differs  qualitatively  in  a  way  that  can  be  tested  experimentally.  To 
illustrate  this  qualitative  difference,  consider  a  subject  who  is  asked  to 
produce  serial  judgments  about  whether  samples  have  been  drawn  from  a  70/30 
red/blue  bookbag  or  a  30/70  red/blue  bookbag.  Assume  that  the  rating  scale 
is  set  up  so  that  Increased  confidence  in  the  "predominately  red"  hypothesis 
is  associated  with  larger  numbers.  If  the  first  sample  favors  the  red  bookbag 
moderately  strongly  (i.e.,  5  red  and  3  blue),  the  subject  should  make  an 
initial  judgment  at  some  value  favoring  red.  If  the  next  sample  also  favors 
red,  but  more  strongly  (e.g.,  7  red  and  1  blue),  the  subject  should  notice 
this  difference  in  strength  and  adjust  his  judgment  upwards  towards  the  value 
of  the  second  sample.  Note  that  such  an  adjustment  is  directionally  in  accord 
not  only  with  the  averaging  rule,  but  also  with  the  Bayesian  rule.  That  is, 
since  the  likelihood  ratio  of  the  second  sample  is  greater  than  one,  the 
posterior  odds  actually  do  favor  red  more  strongly  than  the  prior  odds. 

If  the  two  samples  are  reversed  in  order,  however,  so  that  the  weaker 
evidence  follows  the  stronger,  the  averaging  rule  and  the  Bayesian  rule  make 
qualitatively  different  predictions  about  the  direction  of  the  adjustment. 

Under  the  Bayesian  rule  the  adjustment  should  still  be  upward:  although  the 
new  sample  is  less  favorable  to  red  than  the  old,  the  likelihood  ratio  of  the 
sample  is  still  greater  than  one,  so  that  the  posterior  odds  should  Increase. 
Under  the  averaging  rule,  however,  the  adjustment  should  be  downward  (i.e., 
toward  a  more  neutral  value)  since  the  value  of  the  new  sample  is  less 
favorable  to  red  than  the  value  of  the  judgment  based  only  on  the  first  sample. 
Thus,  if  subjects  in  the  Bayesian  task  use  the  hypothesized  adjustment 
strategy,  errors  should  be  evident  in  the  direction  of  revisions  when  weaker 
evidence  favoring  a  particular  hypothesis  follows  stronger  evidence  favoring 
the  same  hypothesis. 
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Experimental  Tests  of  the  Directional  Hypothesis 

Two  experiments  were  run  to  test  the  hypochesis  that  subjects  in  the 
Bayesian  inference  task  revise  their  posterior  inferences  in  the  normatively 
wrong  (i.e.,  neutral)  direction  when  a  sample  strongly  favoring  a  given 
hypothesis  is  followed  by  a  weaker  sample  favoring  the  same  hypothesis. 

Since  the  experiments  were  essentially  identical  except  for  the  stimulus 
designs,  they  are  discussed  together. 

Method 

Experimental  tasks.  Subjects  in  both  experiments  were  asked  to  put 
themselves  in  the  place  of  a  machinist  whose  job  was  to  make  decisions  con¬ 
cerning  the  maintenance  of  complex  milling  machines  based  on  samples  of  parts 
produced  by  the  machines.  Subjects  in  Experiment  1  were  asked  to  make  judg¬ 
ments  about  whether  machines  needed  maintenance  or  not.  They  were  instructed 
that  machines  which  were  working  properly  were  about  as  likely  to  produce 
parts  that  were  a  little  too  large  as  to  produce  parts  that  were  a  little  too 
small.  Broken  machines,  on  the  other  hand,  were  described  as  tending  to 
produce  parts  of  which  about  75%  were  a  little  too  large  and  25%  were  a  little 
too  small.  Thus,  in  abstract  terms,  subjects  in  Experiment  1  were  asked  to 
decide  between  the  hypothesis  50%  l8rge/50%  small  (**5q/5q)  the  hypothesis 
75%  large/25%  small 

The  task  for  Experiment  2  was  similar  except  that  subjects  were  asked  to 
judge  which  of  two  maintenance  procedures  a  machine  needed.  They  were 
Instructed  that  machines  needing  one  procedure  tended  to  produce  parts  of 
which  about  75%  were  a  little  too  small  and  25%  were  a  little  too  large. 

'line  .leedlng  the  other  procedure,  however,  were  described  as  tending  to 
produce  parts  of  which  about  75%  were  a  little  too  large  and  25%  were  a 
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little  too  small.  Thus,  subjects  in  Experiment  2  decided  between  the  hypothesis 
25%  large/75%  small  ^^25/75^  hypothesis  75%  large/25%  small  • 

Stimulus  designs.  The  stimulus  design  for  Experiment  1  was  a  7  x  7, 
first-sample  x  second-sample,  factorial  design  in  which  the  levels  of  both 
factors  comprised  the  same  seven  sample  distributions  of  large  and  small  parts. 
The  distributions  were,  for  large  and  small  parts,  respectively:  3/7,  4/6, 

5/5,  6/4,  7/3,  8/2,  and  9/1.  The  design  for  Experiment  2  was  also  a  first- 
sample  X  second-sample  factorial  design,  but  with  nine  levels  on  each  factor. 
These  levels  were,  for  large  and  small  parts,  respectively:  1/9,  2/8,  3/7, 

4/6,  5/5,  6/4,  7/3,  8/2,  and  9/1. 

Procedure.  Subjects  were  run  individually  in  sessions  that  took  about 
45  minutes  for  Experiment  1  and  about  60  minutes  for  Experiment  2.  At  the 
beginning  of  the  session  subjects  were  brought  into  a  sound  proof  booth  and 
seated  in  front  of  a  computer  controlled  video  terminal.  Subjects  were  then 
given  general  instructions  about  the  nature  of  the  task  and  shown  how  to  read 
the  stimulus  display.  A  sample  of  a  stimulus  display  for  Experiment  1  is  shown 
in  Figure  1.  At  the  top  of  the  display  is  a  box  showing  a  sample  of  parts, 

7  large  and  3  small.  Under  the  box  is  a  notation  showing  that  this  is  the 
first  sample.  At  the  bottom  of  the  display  is  a  response  scale  anchored  at 
the  left  by  the  words  "MILLING  NORMALLY  (50%)"  and  at  the  right  by  the  words 
"MILLING  TOO  LARGE  (75%) ."  The  display  for  Experiment  2  was  identical  except 
that  the  response  scale  was  anchored  with  "MILLING  TOO  SMALL  (75%  SMALL)"  at 
the  left  and  "MILLING  TOO  LARGE  (75%  LARGE)"  at  the  right. 

Figure  1  about  here 

The  procedure  for  each  trial  was  identical.  Subjects  read  the  information 
for  the  first  sample  and  then  rated  tlie  degree  of  their  belief  as  to  whether 
the  machine  was  milling  normally  or  not  (in  Experiment  1)  or  whether  the 
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machine  was  milling  too  large  or  too  small  (In  Experiment  2).  They  made  their 
ratings  using  a  hand  held  response  device  to  move  the  rating  arrow  (shown  in 
the  middle  of  the  scale  in  Figure  1)  along  the  response  scale.  When  they 
finished  their  initial  rating,  subjects  pushed  a  button  on  the  response  device. 
This  caused  the  rating  to  be  transmitted  to  the  computer  and  also  caused  the 
first  sample  to  be  erased  and  replaced  by  a  second  sample  of  parts  from  the 
same  machine.  Subjects  then  revised  their  initial  rating  to  account  for  the 
new  sample  and  pushed  the  response  button  to  transmit  their  final  response 
to  the  computer.  Finally  subjects  initialized  the  next  trial  by  returning 
the  response  arrow  to  the  middle  of  the  scale. 

In  both  experiments  special  precautions  were  taken  to  make  sure  that 
subjects  understood  the  judgment  task.  In  particular,  the  instructions 
emphasized  that  subjects  should  consider  that  the  two  samples  within  a  trial 
were  drawn  independently  from  the  same  machine,  and  that  samples  from 
different  trials  were  drawn  from  different  machines.  Subsequent  debriefing 
of  the  subjects  indicated  that  all  of  them  had  understood  these  instructions. 

Subjects  in  both  experiments  were  given  15  trials  for  practice  and  then 
were  run  through  two  replications  of  the  stimulus  design.  Experimental 
trials  within  each  replication  were  ordered  randomly,  but  with  the  restriction 
that  no  sample  appear  as  either  first-sample  or  second-sample  on  two  consecu¬ 
tive  trials. 

Subjects.  The  subjects  for  the  two  experiments  were  41  and  39  student 
volunteers,  respectively,  who  served  for  credit  to  be  applied  to  their  course 
grades  in  Introductory  psychology.  In  Experiment  I  subjects  were  all  males; 

In  Experiment  2  subjects  were  approximately  evenly  divided  between  the  sexes. 

Results  and  Discussion 

.  Ratings  of  single  samples.  In  order  to  test  the  adjustment  hypothesis. 
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It  is  necessary  to  determine  at  least  roughly  what  subjective  values  the  subjects 
attached  to  the  various  sample  types.  This  can  be  done  by  looking  at  the 
responses  subjects  gave  to  the  first  sample  of  each  stimulus  pair.  The  data 
are  given  in  Table  1,  averaged  over  both  subjects  and  replications.  Ratings 
have  been  scaled  to  run  between  0  and  1. 

Table  1  about  here 

For  reasons  that  will  be  clear  shortly,  it  is  best  to  begin  with  Experiment 
2.  Basically,  the  results  are  very  simple:  ratings  of  the  likelihood  of  ^75/25 
increased  essentially  linearly  from  1/9  samples  to  9/1  samples  with  5/5  samples 
being  rated  as  neutral.  In  fact,  the  close  numerical  correspondence  between 
the  ratings  and  the  proportion  of  large  parts  in  the  samples  suggests  that 
subjects  probably  produced  these  initial  ratings  by  the  simple  expedient  of  using 
the  sample  proportion  as  a  judgmental  "anchor"  (Tversky  &  Kahneman,  1974).  The 
results  of  Experiment  1  are  more  difficult  to  understand.  In  gross  terms,  there 
is  no  problem:  samples  of  3/7  through  6/4  were  rated  as  supporting 
samples  of  7/3  through  9/1  were  rated  as  supporting  But  the  data 

deviated  from  the  norm  ordlnally  in  that  5/5  samples  were  judged  to  be  more 
supportive  of  H5Q/5Q  than  either  3/7  or  4/6  samples. 

Inspection  of  single  subject  data  for  Experiment  1  revealed  clear  Individual 
differences  in  how  subjects  evaluated  these  samples.  Most  subjects  (38  out  of 
41)  could  be  assigned  to  one  of  three  groups.  The  first  group  (15  subjects) 
ordered  the  samples  appropriately,  with  3/7  samples  being  taken  as  stronger 
evidence  for  than  4/6  samplus  and  these  in  turn  as  stronger  evidence  than 

5/5  samples.  These  subjects  will  >e  called  the  "likelihood  ratio"  group  since 
they  appeared  to  judge  the  value  of  sample  evidence  in  terms  of  the  relative 
degree  to  which  it  supported  the  two  hypotheses.  The  second  group  (12  subjects) 
ordered  the  samples  in  exactly  inverse  order  to  the  norm:  5/5  samples  were  teken 
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to  be  the  strongest  evidence  of  followed  by  4/6  and  3/7  samples  in  that 

order.  These  subjects  appeared  to  judge  the  samples  according  to  how  representa¬ 
tive  they  were  of  a  50/50  generating  process  (Kahneman  &  Tversky,  1972).  They 
will  therefore  be  called  the  "representativeness"  group.  The  third  group  (11 
subjects)  ordered  the  samples  5/5,  3/7,  4/6.  This  "mixed"  group  appeared  to  be 
influenced  by  representativeness  only  if  samples  were  "perfectly"  representative. 
Otherwise  they  seemed  to  rely  on  relative  likelihood  considerations. 

The  fact  that  many  subjects  in  Experiment  1  appeared  to  prefer  using 
representativeness  rather  than  relative  likelihood  as  the  basis  for  evaluating 
samples  causes  no  problems  for  testing  the  adjustment  hypothesis  other  than 
making  it  necessary  to  perform  separate  tests  for  the  various  groups.  But  the 
result  is  generally  problematical  since  mlsorderings  of  sample  data  have  not 
been  reported  previously  in  the  Bayesian  literature  and  also  since  the  effect 
did  not  occur  either  for  Experiment  2  or  for  the  samples  supporting  Hy^y2^ 
in  Experiment  1.  It  is  therefore  worth  considering  why  the  error  occurred  when 
it  did. 

One  possibility  is  that  subjects  may  have  believed  mistakenly  that  normal 
machines  always  produce  50/50  samples.  But  this  is  unlikely  since  subjects  were 
Instructed  quite  explicitly  that  normal  machines  do  not  always  produce  exactly 
50%  large  parts,  but  sometimes  produce  more  and  sometimes  less.  A  better 
possibility  seems  to  be  that  the  task  violated  the  conventional  semantics  of 
what  It  means  for  a  machine  to  be  working  normally.  Certainly  it  is  a  bit  odd 
linguistically  to  say  that  the  best  evidence  for  the  normal  functioning  of  a 
machine  is  that  it  produces  a  sample  with  an  abnormally  large  number  of  small 
parts.  Thus,  subjects  may  have  slipped  unintentionally  into  using  a  hybrid 
strategy  in  which  samples  favoring  v^re  evaluated  with  respect  to  both 

hypotheses  whereas  samples  favoring  were  evaluated  only  with  respect  to 

^50/50’  course,  there  is  no  way  to  be  sure  that  this  is  what  happened. 
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But  if  the  explanation  is  correct,  errors  of  this  sort  should  not  be  found  when 
the  task  of  choosing  between  ^75/25  framed  more  neutrally,  as  in 

the  conventional  "bookbags  and  poker  chips"  format. 

Adjustments  for  second  samples.  The  adjustment  hypothesis  states  that  when 
subjects  are  given  two  pieces  of  evidence  that  favor  the  same  hypothesis  but  to 
different  degrees  (hereafter  called  a  "homogeneous  sample  pair") ,  their  judged 
confidence  in  the  hypothesis  will  increase  if  the  stronger  sample  follows  the 
weaker,  but  will  decrease  if  the  weaker  sample  follows  the  stronger.  Bayes* 
theorem,  in  contrast,  specifies  that  confidence  should  increase  independently  of 
the  order  of  the  samples.  Thus,  if  the  adjustment  hypothesis  is  correct, 
directional  errors  in  revision  should  occur  more  frequently  for  strong-weak 
sample  orders  than  for  weak-strong  sample  orders. 

Table  2  gives  the  proportions  of  directionally  correct  adjustments  (relative 
to  the  Bayesian  norm)  for  strong-weak  sample  orders  and  weak-strong  sample  orders. 
These  are  pooled  over  sample  pair,  subjects,  and  replications.  Since  the  subjects 
in  Experiment  1  disagreed  about  the  ordering  of  the  sample  evidence,  their  data 
have  been  divided  into  subgroups. 

Table  2  about  here 

The  basic  result  is  clear  in  all  four  comparisons:  the  data  support  the 
adjustment  hypothesis.  Adjustments  were  almost  always  made  in  the  correct 
direction  for  weak-strong  sample  pairs  and  almost  never  made  in  the  correct 
direction  for  strong-weak  sample  pairs,  £  <  .01  for  all  group  differences. 

Recency  effects.  The  adjustment  data  above  are  qualitative  in  the  sense 
that  they  do  not  show  whether  differences  in  sample  order  produced  differences 
in  the  magnitudes  of  the  final  responses.  Under  the  adjustment  hypothesis, 
however,  any  of  three  quantitative  effects  might  occur:  (1)  adjustments  might 
be  exactly  sufficient  to  produce  a  true  arithmetic  "running  average"  of  the 
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sample  evidence  in  which  case  the  final  response  data  would  show  no  effect  of 
order;  (2)  adjustments  might  be  insufficient  for  arithmetic  averaging  in  which 
case  final  responses  would  be  closer  numerically  to  the  first  sample  than 
ought  to  be  the  case,  constituting  a  judgmental  primacy  effect;  (3)  adjustments 
might  be  over  sufficient  for  arithmetic  averaging  in  which  case  final  responses 
would  be  closer  numerically  to  the  second  sample  than  they  ought,  constituting 
judgmental  recency.  In  either  of  the  latter  two  situations  order  effects 
would  occur  for  corresponding  strong-weak  and  weak-strong  sample  pairs. 

Final  response  data  for  homogeneous  sample  pairs  are  given  in  Table  3 
organized  according  to  the  relative  strength  of  the  two  samples.  The  data  for 
Experiment  1  have  been  broken  down  according  to  how  subjects  ordered  the  sample 
types.  For  likelihood  ratio  subjects  (but  not  for  representativeness  subjects 
or  mixed  subjects)  there  was  a  significant  recency  effect  that  appeared  as  a 
tendency  for  final  responses  to  be  more  extreme  when  samples  were  ordered 
weak-strong  than  when  they  were  ordered  strong-weak,  F(l,14)  *  6.91,  £  <  .05. 
That  is,  for  pairs  that  favored  weak-strong  responses  were  smaller 

than  strong-weak  responses  and  for  pairs  that  favored  weak-strong 

responses  were  larger  than  strong-weak  responses.  Thus,  more  extreme  final 
responses  were  generated  when  the  more  extreme  of  two  samples  that  favored  the 
same  hypotheses  was  presented  in  the  second  position.  There  was  also  a 
prominent  recency  effect  for  homogeneous  pairs  in  Experiment  2,  F(l,38)  = 

39.77,  £  <  .001. 

Table  3  about  here 

Final  responses  for  heterogeneous  sample  pairs  are  given  in  Table  4. 

Since  the  samples  in  these  pairs  favored  different  hypotheses,  the  data  have 
been  organized  according  to  whether  the  ordered  pair  required  adjustment  toward 
H50/50  (leftward  adjustment)  or  toward  (rightward  adjustment).  For 
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both  experiments  there  was  a  strong  recency  effect  manifested  by  a  tendency 
for  pairs  requiring  leftward  adjustment  to  produce  smaller  final  responses 
than  equivalent  pairs  requiring  rightward  adjustment:  F(l,40)  =  10.40, 

<  .01  for  Experiment  1  and  F(l,38)  =  27.89,  £  <  .001  for  Experiment  2. 

In  other  words ,  pairs  in  which  the  two  samples  favored  different  hypotheses 
produced  final  responses  that  favored  more  strongly  (i.e. ,  were  nearer 

0)  if  the  ^5Q^5Q  sample  was  in  the  second  position, and  favored  ^75/25 
strongly  (i.e.  were  nearer  1)  if  the  H5Q/5Q  sample  was  in  the  first  position. 

In  fact,  the  effect  was  so  strong  in  Experiment  2  that  final  responses  for 
corresponding  pairs  tended  to  lie  on  opposite  sides  of  the  neutral  point 
of  the  response  scale. 

Table  4  about  here 

Recency  effects  have  been  reported  previously  in  the  Bayesian  literature 
(Pltz  &  Reinhold,  1968;  Shanteau,  1970,  1972)  with  the  point  being  stressed  that 
order  effects  are  not  allowed  by  the  Bayesian  model.  But  order  effects  are  also 
important  for  what  they  suggest  about  the  underlying  representation  and  use  of 
task  information  during  the  judgment  process.  In  the  present  case,  the  tendency 
for  subjects  in  Experiment  2  to  make  their  final  responses  lie  on  the  side  of 
the  response  scale  favored  by  the  second  sample  suggests  that  subjects  may  have 
represented  task  Information  internally  in  such  a  way  that  they  were  unable  to 
appreciate  simple  quantitative  relationships  between  successive  samples.  For 
example,  for  the  pair  (1/9,  8/2)  subjects  initiated  their  responses  at  an 
average  value  of  .09  following  the  1/9  sample  and  then  adjusted  to  .59  follow¬ 
ing  the  8/2  sample.  Thus,  subjects  expressed  mild  confidence  in 
But  subjects  ought  to  have  favored  since  the  sample  favoring  ^25115 

was  more  extreme  than  the  sample  favoring 

Why  do  such  errors  occur?  Certainly  they  would  not  be  expected  if  subjects 
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were  assumed  to  compute  running  sample  differences.  But  If  subjects  concentrate 
on  adjusting  for  the  difference  between  the  new  sample  and  the  old  response, 
they  may  be  unlikely  to  compare  the  samples  directly  and  hence  may  fall  to 
notice  which  of  the  samples  should  dominate  the  response.  In  other  words,  once 
stimulus  information  has  been  transformed  into  response  mode,  all  further 
processing  of  and  adjustments  to  the  interim  response  may  be  based  solely  on  that 
transformed  value  rather  than  on  some  more  literal  representation  of  the  raw 
stimulus  Information. 


Discussion 


The  experiments  reported  here  demonstrate  that  the  adjustment  processes 
used  by  subjects  in  the  Bayesian  task  are  consistent  with  an  averaging  rule  and 
inconsistent  with  the  multiplicative  rule  specified  by  Bayes*  theorem.  In 
particular,  the  present  experiments  supplement  the  earlier  work  of  Shanteau 
(1975)  and  Troutman  and  Shanteau  (1977)  by  demonstrating  qualitative  adjustment 
errors  for  diagnostic  samples.  But  the  primary  purpose  of  the  paper  is  not  so 
much  to  support  the  averaging  rule,  as  it  is  to  suggest  the  psychological 
processes  that  produce  averaging. 

Adjustment  Mechanisms  and  Algebraic  Judgment 

The  main  thesis  of  the  present  paper  Is  that  subjects  in  Bayesian  Inference 
tasks  average  when  they  ought  to  multiply  bee: use  they  produce  their  judgments 
by  using  a  serial  adjustment  strategy  that  just  happens  to  be  equivalent  to 
averaging.  But  subjects  In  judgment  tasks  don't  always  average.  In  fact,  there 
are  some  tasks  such  as  judging  the  worth  of  gambles  (Anderson  &  Shanteau,  1970; 
Shanteau,  1974;  Tversky,  1967)  and  judging  the  likelihood  of  joint  events 
(Beach  &  Peterson,  1966;  Lopes,  1976;  Shuford,  1959)  In  which  subjects  seem  to 
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multiply.  Vfhy  do  they  average  In  one  case  and  not  the  other? 

The  position  taken  here  is  that  averaging  and  multiplying  reflect  basically 
similar  adjustment  strategies  operating  in  different  task  environments.  Tasks 
that  produce  averaging  usually  call  for  bidirectional  adjustments  in  which  the 
judgment  is  sometimes  increased  and  sometimes  decreased,  as  is  the  case  in  the 
Bayesian  task  for  heterogeneous  sample  pairs.  Tasks  that  produce  multiplying, 
on  the  other  hand,  usually  call  for  adjustments  which  are  unidirectional  and 
most  often  downward.  For  example,  one  can  judge  the  value  of  a  gamble  by 
beginning  with  the  value  of  the  prize  to  be  won  and  then  adjusting  downward  in 
proportion  to  the  probability  of  winning  (Lopes  &  Ekberg,  1980).  Similarly, 
one  can  judge  the  probability  of  a  joint  event  by  beginning  with  the  probability 
of  one  event  and  then  decreasing  this  in  proportion  to  the  probability  of 
the  other  event  (Lopes,  1976).  Both  processes  involve  downward  adjustment  and 
both  are  equivalent  to  multiplying. 

A  basic  corollary  of  the  present  view  is  that  subjects  in  judgment  tasks 
do  not  "choose"  judgment  rules  in  the  sense  that  they  decide  how  information 
ought  to  be  combined.  Rather  they  choose  adjustment  processes  that  seem  to  "fit" 
the  task,  both  in  terms  of  the  ease  with  which  the  process  can  be  executed 
mentally  and  in  terms  of  the  degree  to  which  the  process  generates  plausible 
judgments.  Usually  this  works  out  reasonably  well.  But  now  and  then  the 
adjustment  strategy  that  seems  to  subjects  to  fit  the  task  best  will  turn  out 
to  be  normatlvely  inappropriate. 

It  is  also  clear  that  the  judgment  rule  subjects  employ  for  judging  A  from 
B  and  C  will  not  necessarily  bear  a  logical  relationship  to  the  rules  they 
employ  for  judging  B  from  A  and  C  or  C  from  A  and  B  (Anderson  &  Butzln,  1974; 
Graesser  &  Anderson,  1974).  For  example,  Graesser  and  Anderson  asked  subjects 
to  judge  Income,  generosity,  or  expected  gift  size  from  information  about  the 
other  two.  The  data  revealed  that  although  income  and  generosity  were  combined 
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multlplicatively  in  judgments  of  expected  gift  size,  there  were  no  corresponding 
dividing  relations  between  gift  size  and  generosity  for  judgments  of  Income,  or 
income  and  gift  size  for  judgments  of  generosity.  Instead,  these  latter  judg¬ 
ments  seemed  to  follow  some  sort  of  subtracting  rule.  Thus,  the  human  judgment 
system  appears  to  comprise  a  set  of  individual  judgment  processes  that  may  some¬ 
times  correspond  to  simple  arithmetic  operations,  but  that  tend  to  be  both  non- 
reverslble  and  largely  unrelated  to  one  another. 

Judgment  Processes  and  Memory  Processes 

The  present  adjustment  model  makes  two  fundamental  assumptions  about  the 
characteristics  of  the  judgment  apparatus.  First,  it  assumes  the  existence  of 
an  internal  quantitative  dimension  that  is  reasonably  continuous,  and  second, 
it  assumes  that  this  dimension  is  directly  available  for  manipulation  by  the 
judge.  Judgment  itself  is  hypothesized  to  occur  as  quantitative  information 
is  extracted  from  stimuli  one  by  one  (the  evaluation  process)  and  used  to 
modify  the  current  judgment  (the  adjustment  process). 

A  similar  model  was  proposed  some  years  ago  by  Anderson  and  Hubert  (1963) 
to  account  for  the  fact  that  order  effects  in  impression  formation  differ  from 
the  effects  that  would  be  expected  if  the  impression  depended  on  verbal  memory. 
They  concluded  that  judgments  in  the  impression  task  involve  "a  memory  process 
which  is  distinct  from,  and  not  dependent  on,  the  Immediate  verbal  memory  for 
the  adjectives  just  heard  (p.  386)." 

An  Important  feature  of  the  Anderson  and  Hubert  (1963)  model  is  that  it 
allows  the  possibility  that  judgment  processes  may  sometimes  be  Insensitive  to 
certain  Interactive  relationships  among  stimuli.  For  example,  a  judge  might 
fall  to  notice  that  the  current  piece  of  stimulus  Information  is  inconsistent 
or  redundant  with  previously  integrated  information.  Or,  as  may  have  actually 
occurred  for  heterogeneous  pairs  in  Experiment  2,  a  judge  might  produce  a 
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response  to  a  sequence  of  stlnmll  that  differs  qualitatively  from  the  response 
that  would  be  made  if  the  stimuli  were  retrieved  from  verbal  memory  and  combined 
in  raw  form  before  being  submitted  to  the  evaluation  process. 

Quantitative  judgment  is  both  commonplace  and  fundamental,  and  much  is 
known  about  the  content  and  algebraic  structure  of  Judgment  data.  But  relatively 
little  is  known  about  actual  judgment  processes.  This  is  true  in  part  because 
judgment  processes  do  not  yield  gracefully  to  the  sorts  of  experimental  manipu¬ 
lation  that  li?'-  i  beer  useful  in  other  cognitive  domains.  But  the  present  experi¬ 
ments  as  well  ^  '  hose  by  Shanteau  (1975)  ,  Troutman  and  Shanteau  (1977) ,  and 

Anderson  and  Hubert  (1963)  suggest  that  the  study  of  temporal  effects  in  judgment 
tasks  such  ae  order  effects  and  adjustment  effects  may  provide  deeper  insights 
into  the  cognitive  mechanisms  that  are  Invoked  when  people  make  Judgments. 


•If. 
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This  happens,  In  fact,  to  be  In  accord  with  the  actual  likelihood  ratios 


of  the  samples. 
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Table  1 


Mean 

Ratings  of 

First  Samples 

Experiment  1 

Experiment  2 

Sample 

Rating 

Sample 

Rating 

L/S 

L/S 

3/7 

.20 

1/9 

.08 

.23 

2/8 

.16 

5/5 

.16 

3/7 

.24 

.45 

4/6 

.31 

7/3 

.73 

5/5 

.49 

8/2 

.85 

6/4 

.67 

9/1 

.94 

7/3 

.74 

8/2 

.82 

9/1 

.91 

Note.  A  rating  of  0 

Indicates  complete 

confidence  In 

for  Experiment 

and  for  Experiment  2. 

A  rating  of  1  Indicates  complete  confidence  in 

hbny 
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Table  2 

Proportion  of  Directionally  Correct  Adjustments 
for  Homogeneous  Sample  Pairs 


Strong- 

Weak 

Weak- 

Strong 

^diff 

(df) 

Experiment  1 

Likelihood  ratio  group 

.35 

.89 

32.17* 

(1,14) 

Representativeness  group 

.30 

.86 

76.14* 

(1,11) 

Mixed  group 

.29 

.92 

115.74* 

(1,10) 

Experiment  2 

.25 

.92 

165.69* 

(1,38) 

p  <  .01 

Note.  In  Experiment  1  there  were  18  strong-weak  and  18  weak-strong  comparisons  per 
subject  and  in  Experiment  2  there  were  24  strong-weak  and  24  weak-strong  comparisons 
per  subject.  A  list  of  the  sample  pairs  is  available  in  Table  3. 
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Table  4 


Final  Ratings  of  Heterogeneous  I'alrs 
for  Leftward  and  Rightward  Adjuiitment 


Experiment 

1 

Experiment 

2 

Sample 

Leftward 

Rightward 

Sample 

Leftward 

Rightward 

Pair 

(toward 

(toward 

Pair 

(toward 

(toward 

«50/50’ 

”75/25^ 

”25/75^ 

”75/25^ 

(3/7, 7/3) 

.35 

< 

.46 

(1/9, 6/4) 

.23 

< 

.46 

(3/7, 8/2) 

.42 

< 

.54 

(l/9,7/:i) 

.26 

< 

.52 

(3/7, 9/1) 

.44 

< 

.64 

(1/9,8/:) 

.28 

< 

.59 

(4/6, 7/3) 

.43 

< 

.48 

(1/9,9/]) 

.32 

< 

.63 

(4/6, 8/2) 

.46 

< 

.58 

(2/8, 6/4) 

.28 

< 

.51 

(4/6, 9/1) 

.54 

< 

.68 

(2/8,7/:i) 

.31 

< 

.59 

(5/5, 7/3) 

.45 

.45 

(2/8,8/:) 

.36 

< 

.61 

(5/5, 8/2) 

.52 

< 

.60 

(2/8,9/. ) 

.37 

< 

.72 

(5/5, 9/1) 

.55 

< 

.66 

(3/7, 6/4) 

.35 

< 

.52 

(6/4, 7/3) 

.64 

.64 

(3/7,7/:;) 

.37 

< 

.58 

(6/4, 8/2) 

.72 

< 

.77 

(3/7,8/:) 

.41 

< 

.64 

(6/4, 9/1) 

.73 

< 

.82 

(3/7,9/]) 

.44 

< 

.73 

(4/6, 6/4) 

.43 

< 

.57 

(4/6, 7/3) 

.45 

< 

.65 

(4/6, 8/2) 

.49 

< 

.70 

(4/6, 9/1) 

.51 

< 

.75 

Note.  The  terms  "rightward"  and  "leftward"  refer  to  the  end  of  the  response 
scale  favored  by  the  second  sample.  Leftward  ratings  smaller  than  corresponding 
rightward  ratings  Indicate  recency. 


27 


SAMPLE 

7  LARGE 
3  SMALL 


FIRST 


4< 


MILLING 

NORMAI.LY 

(507) 


MILLING 
TOO  LARGE 

(75%) 


Figure  1 
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